Machine Learning Systems Design
Note
Most of the notes are from the book Machine Learning Systems Design
Overview of Machine Learning Systems Design
MLOps
MLOps = Machine Learning Operations
- Ops in MLOps comes from DevOps, short for Developments and Operations.
- To operationalize something means to bring it into production, which includes deploying, monitoring, and maintaining it.
- MLOps is a set of tools and best practices for bringing ML into production.
ML in research vs. ML in production
These are important for ML in production than in research:
- stakeholder involvement
- computational priority
- the properties of data used
- the gravity of fairness issues
- the requirements for interpretability
Requirements for ML Systems
Reliability
Scalability
- resource scaling
- artifact management
Maintainability
Adaptability
- data distribution shifts
- Continual Learning
Iterative Process of Developing an ML system
- Project scoping
- Data engineering
- ML model development
- ML Model Deployment
- ML System Monitoring and Continual learning
- Business analysis
Infrastructure and Tooling for MLOps
Infrastructure
Storage & Compute
- storage layer: where data is collected and stored
- hard drive disk (HDD), solid state disk (SSD)
- AWS Storage & Databases#Amazon Simple Storage Service (Amazon S3), Snowflake
- compute layer: all the compute resources a company has access to and the mechanism to determine how these resources can be used
- CPU, GPU
- Amazon EC2, GCP
Development Environment
- IDE (Integrated development environment)
- cloud IDE: AWS Cloud9, Amaon SageMaker Studio
- versioning
- CI/CD
Resource Management
- Cron, Schedulers, Orchestrators
- cron: it run a script at a predetermined time and tell you whether the job succeeds or fails
- scheduler & orchestrator
- scheduler: cron programs deciding when
- orchestrator: cron programs concerning where to get resources. e.g. Kubernetes
- scheduler and orchestrator can be used interchangeably
- Data science workflow management
Pipeline Orchestration
orchestration allows managing end to end traceability of pipeline using automation to capture specific inputs, outputs, and artifacts of a given task.
-
Model lineage
- for each version of a trained model, versions of
- data used
- code/hyperparamters used
- algorithm/framework
- training docker image
- packages/libraries
- for each version of a trained model, versions of
-
Model registry
- centrally manage model metadata and model artifacts
- track which models are deployed across environments
-
Artifact tracking
- artifact = the output of a step or task can be consumed by the next step in a pipeline or deployed directly for consumption
Coping with ML training challenges
Checkpointing
- check points include
- model architecture
- model weights
- training configurations
- optimizer
- take two things into consideration: frequency and number of checkpoints
Distributed training strategies
-> scale challenges: increased training data volume or increased model size and complexity
- Distributed training = split training load across multiple compute nodes or clusters (CPU/GPU)
- there are two strategies
- data parallelism: training data split up + model replicated on all notes
- model parallelism: training data replicated + model split up on all nodes
- which to choose:
Model Integration
= integrating models with ML applications